Using Public Data and Maps for Powerful Data Visualizations

Joy Payton

About This Session

Please follow along… I include links here you’ll want to click!

The second section is strongly USA-centric, but the rest is globally applicable.

  • Section 1: APIs
  • Section 2: US Census data
  • Break / Q&A
  • Section 3: Socrata public data portals
  • Section 4: Other public data portals
  • Break / Q&A
  • Section 5: Mapping data using leaflet
  • Q&A / Next Steps

About Your Instructor

Joy Payton (she/her) is…

  • Data Scientist
  • Data Educator (so I know a little about a lot)
  • Research Reproducibility Advocate
  • Census Nerd (first job after undergrad!)
  • Map Fan
  • Political Junkie

About Your Instructor

Joy Payton (she/her) is NOT …

  • GIS Wizard
  • Demographer
  • Physician
  • Statistician

You Can Reach Me…

Section 1: APIs

What are APIs and why do they matter? We’ll be talking about the Census API, the SODA API, and other API endpoints in this session, so a short intro to APIs is in order.

What is an API?

API stands for Application Programming Interface. It’s a way for people or computers to interact with software in a prescribed way. A common type of Web API is based on the REST architecture and are often referred to as “RESTful APIs”.

A RESTful API promotes a “resource-oriented” API where URLs (web addresses) map to objects or resources that you can then interact with (like .csvs of data).

Why Use an API?

Why use APIs? They provide a structured, consistent way to carry out a process so that it can be automated and standardized. An API provides consistency around a process.

API Use Cases

Specific API use cases might include:

  • Getting the latest, most up-to-date data for an application that counts hospital admissions for influenza
  • Requesting, on a news site, only news stories related to public health in Ghana
  • Communicating between code, such as sending success messages or heartbeats

In all of these cases, you want to get predictable results using a method that’s easy to reproduce. Let’s concentrate for now on the use case of getting fresh data via an API.

APIs Mean Fresh Data!

While many data-centric applications allow you to download data by using a form submission or clicking on buttons that save data to your computer, that might not be the most useful way to work with data in an ongoing way.

If your data might change regularly, with more data being added, it’s probably smart to add a few lines to your code that get the latest data, instead of depending on a potentially stale CSV in a folder on your computer.

The Alternative… is Awful

  • Go to https://fake.site/data …
  • Log in using the username:mike and password:mypassword
  • Make sure you’ve checked the following boxes in the data request page: …
  • Save the file with the following naming convention: …
  • Store the file in the sharefile folder within the directory called …

Section 2: US Census Data

Exercise 1

First, let’s start by having you clone the materials for the workshop to your own computer.

  • Go to https://github.com/pm0kjp/r-medicine-2022/ and click on the green “Code” button to either:
  • Clone the repository (if you already use git, you know how this works) or
  • Download the zipped contents (if you’re not a git user)

Exercise 1

Once you have these files downloaded, you will notice the following file structure:

./
├── 📁 data
├── 📁 scripts
├── .gitattributes
├── .gitignore
└── README.md

Please add a folder called private at the same level as data and scripts, so that it looks like this:

./
├── 📁 data
├── 📁 private
├── 📁 scripts
├── .gitattributes
├── .gitignore
└── README.md

Decennial Census

The US Census Bureau is bound by the Constitution to do a full (not sampled) census of all people within the US every ten years. This determines the number of seats in the US House of Representatives and are used to draw district boundaries. This is the Decennial Census.

American Community Survey

In addition to the full population census, the Census Bureau is also responsible for conducting the American Community Survey (ACS) which uses sampling and inferential statistics to make estimates of social factors that affect your patients and research subjects… neighborhood characteristics like:

  • Education levels, demographic characteristics
  • Poverty rates, mean and median income
  • Computer usage, housing characteristics
  • Crime, commuting, and much more!

Versions of the ACS

Note that the ACS also has one and five year versions. Five year ACS data includes estimates for the entire country, while one year versions concentrate on population-dense areas and have smaller sample sizes.

Other Census Work

There are additional censuses performed by the Census Bureau that we won’t talk about, such as an Economic Census done every five years and the Census of Governments done every five years.

Geographies

Census data is collected at and aggregated to various levels:

  • The country as a whole
  • States / territories
  • Counties
  • ZIP Code Tabulation Areas (approximations of ZIP Codes)
  • Urban areas
  • Census Tracts (1-8k people)
  • Census Block Groups
  • Census Blocks (600 - 3k people)
  • and probably more I’ve forgotten about!

Census Website

The website of the Census Bureau (https://www.census.gov) is a veritable treasure trove of information about what’s available and how to use Census data.

You can obtain data and download it in the Census Data browser at https://data.census.gov/. The tables you will find here are optimized for human readability, not always for processing via script.

Practical Section 2

Let’s take a look at these two websites. Delve in to what’s available. How would this data be useful for you in your clinical practice or research?

API Overview

Plan to work with Census Bureau data over and over again? It’s worth the time to use APIs instead of downloading data from the website manually.

This is what the Census Bureau says about API usage:

Any user may query small quantities of data with minimal restrictions (up to 50 variables in a single query, and up to 500 queries per IP address per day). However, more than 500 queries per IP address per day requires that you register for an API key.

Getting an API Key

From the same source:

Once you have an API key, you can extract information from Census Bureau data sets using a variety of tools including JSON, R, Python, or even by typing a query string into the URL of a Web browser.

Practical Section 3

The Census Bureau offers free API credentials at https://api.census.gov/data/key_signup.html

Do that now.

We’ll wait.

No, really, do that now, that way you can work on the practical sections!

API Endpoints

Check out their list of API endpoints.

tidycensus is a package that helps you work with specific APIs offered by the Census Bureau.

Documentation

Great documentation in the ACS API Handbook

FIPS

“FIPS” stands for “Federal Information Processing Standards” but often, when you talk to people, they’ll apply the term to whatever their particular federal data is… so, e.g., instead of “Census tract identifier” they’ll say “the FIPS”. It’s a term that therefore ends up having lots of meanings.

There are FIPS codes for states, counties, tracts, and blocks, and when concatenated, they end up being a single geographic id. Tracts and blocks can and will change from census to census!

FIPS Example

For example, the state code for Pennsylvania is 42, the county code for Philadelphia is 101, and the census tract within Philadelphia where the University City campus of the Children’s Hospital of Philadelphia stands is 036901 (the last two digits can be thought of as ‘after the decimal point’, so this has a “human” name of Census Tract 369.01). Further, the block group is 4, and the full block number is 4002, so you might be using a “GEOID” of 421010369014002 (if the block is included), or just 42101036901 (if you have tract level data only).

Granularity of Data

Census data is very very specific. If, for example, you’re interested in income data for a given tract, you might find columns that include descriptions like:

  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Total households - Less than $10,000
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Total households - $10,000 to $14,999
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Total households - $15,000 to $24,999
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Total households - $25,000 to $34,999
  • … and so on ..

Granularity of Data

Or:

  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Less than $10,000
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - $10,000 to $14,999
  • … and so on …

Granularity of Data

Or:

  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - With Supplemental Security Income - Mean Supplemental Security Income (dollars)
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - With cash public assistance income - Mean cash public assistance income (dollars)
  • … and so on…

Granularity of Data

Or:

  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Median earnings for workers (dollars)
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Median earnings for male full-time, year-round workers (dollars)
  • INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Median earnings for female full-time, year-round workers (dollars)

Granularity of Data

You will likely need to do a bit of honing your question: families only, or all households (say, a single person, or a group home)? Do you want to look at statistics across the board or specify race, sex, or hispanicity? What is considered income, and what benefits? Do you want to include SSI? Measure it separately? What about welfare?

Estimates and MOEs

You’ll also find, for any given measure, a few variables related to it:

  • Estimate – used when a scalar count or value is needed, like median income or number of white women
  • Margin of error – used to indicate the precision of the estimate
  • Percent – used when a percent is needed, like percent of families below the poverty line
  • Percent Margin of Error – used to indicate the precision of the percent estimate

Note that all four columns are generally present although only two make sense for any given measure!

Sparsity

Every area of the US belongs to a census tract, even if it’s an area in which people don’t normally live (like a park or lake or airport). That’s why you might see census tracts with little to no data. Don’t panic if you see that a few tracts have very sparse data – they may be one of these special tracts.

Practical Section 4

Time to play with data! But first, check your email, then go to the materials for this course, the stuff you downloaded from the repository.

  • Find the email with your Census API key
  • Add that to a new text file (you can make a new text file in RStudio or any text editor)
  • Save it in the /private directory you created earlier, as census_api_key.txt

Practical Section 4

Now that you’ve stored your API key…

Go to the materials for this course, the stuff you downloaded from the repository. Look inside /scripts and open census_data.Rmd.

(Feeling fancy? It might be a good idea to start a Project using the top level directory as the location…)

Practical Section 4

You probably have to install some things. RStudio may have already alerted you to this. Alternatively, uncomment and run line 15 of census_data.Rmd. You can skip most of the verbiage in the next few sections, it’s the stuff I’ve already explained.

Scroll all the way down to the chunk titled “key_setting” around line 140, where we’ll pick up your census key. You did save it, right? Then, run the blocks one at a time, pausing to note what each one does and how you might choose something different as an example. Experiment some! This will lead into our Break / Q&A time, so govern your time as you see fit.

Section 2: Public Data Portals

In this section we’ll delve a bit more into what APIs are and why to use them, and give you some tips for navigating public data portals.

Not all public data of interest is located in well-organized portals – you may need to scrape an HTML table from a webpage or download a fixed-width file from an FTP server, but today we’re going to stick to these use cases.

Public Data Portal

The easiest way to find public data that’s relevant to you is to search for “Open Data” plus your search term, and look for sites that have “data” in the URL or name.

For example, let’s search for “open data gun violence new york city”. A few results show up:

## Public Data Portal {.smaller}

The first site has NYC gun violence data in the form of tables inside a pdf (not that helpful for us). The second site, which has “data” in the URL, is much more promising…

NYC Open Data

NYC Open Data is a particularly good open data resource. Still, it’s not perfect! The NYC Crime dataset is a very small dataset of only 11 rows!

But it comes from a much larger dataset, the misnamed “NYPD Complaint Data Current (Year to Date)” which actually has crime data from 2019 forward.

NYC Open Data

Let’s take a peek at the API button:

Socrata API

Because the Socrata Open Data API (SODA) is consistent across the many public data sources that employ it, we can learn some of the basic use cases once and be well-equipped to use the same methods in multiple places.

The Socrata Open Data API (SODA) uses URL query strings (also known as URL queries or URL parameters) to pass the data provider some details about what data you want.

Aside: Why is this API called “open”? Because it allows any user to download data without login credentials – it’s, well, open!

URL Queries

Consider this URL: https://www.amazon.com/s?k=r+for+data+science&crid=1CZ68952YCOJU&sprefix=r+for+data+sci%2Caps%2C143&ref=nb_sb_ss_i_1_14

You may have seen long URLs like this one, which have question marks, equals signs, and ampersands. These long query strings generally give specific data – in this case, I’m asking for a specific book title, which I left in lower case: “r for data science”. Let’s take a look at this specific query string:

?k=r+for+data+science&crid=1CZ68952YCOJU&sprefix=r+for+data+sci%2Caps%2C143&ref=nb_sb_ss_i_1_14

Query Strings

?k=r+for+data+science&crid=1CZ68952YCOJU&sprefix=r+for+data+sci%2Caps%2C143&ref=nb_sb_ss_i_1_14

These are the keys (variables, named data points) and values we see in the query string:

  • “k”, which is equal to “r+for+data+science”
  • “crid” (maybe my customer ID?): “1CZ68952YCOJU”
  • “sprefix” (seems to reiterate the book title and a few other things): “r+for+data+sci%2Caps%2C143”
  • “ref”, which may be some code about my search history or how I got to this page: “nb_sb_ss_i_1_14”

Query Strings

?k=r+for+data+science&crid=1CZ68952YCOJU&sprefix=r+for+data+sci%2Caps%2C143&ref=nb_sb_ss_i_1_14

You’ll notice that a query string starts with a ? and is followed by key-value pairs with the format “key=value”. There are no spaces allowed, which is why URLs will use things like plus signs or %20 to indicate spaces. Between key-value pairs, we add an ampersand (&), and can string together many key-value pairs in this way.

It’s important to become comfortable with query strings like the one above so that you can effectively construct query strings for your work with the Socrata API.

Socrata Tips and Tricks

The SODA API will, unless we state otherwise, only give us the first 1000 rows of data, to keep us from accidentally downloading millions of rows.

SODA’s “Simple Filter” functionality, which provides very coarse-grained control that allows you to control what you import based on column name.

Read more about that here: https://dev.socrata.com/docs/filtering.html

Section 3: Mapping With Leaflet

In this section, we’re going to talk about several topics:

  • Maps as a data visualization idiom
  • Choropleths

Map as Idiom: Example 1

What do you see in this reconstruction of a 12th century data visualization?

al-Idrisi’s Tabula Rogeriana (Kitab Rujar)

Map as Idiom: Example 2

modern map of Okinawa

Elements of Maps

Shapes Colors Sizes Language
  • Points

  • Lines

  • Polygons

  • Hue (water is blue)

  • Intensity (e.g. water depth)

  • Thick / thin lines

  • Large / small points

  • Solid / broken lines

  • Scales

  • Numbers

  • Words

Map Files

  • GeoJSON (IETF Standard) https://tools.ietf.org/html/rfc7946 (usually one file)
  • Shapefile (ESRI standard) https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf (usually multiple files in a .zip)

Along with many folks (see, e.g. http://switchfromshapefile.org/) , I believe that geoJSON is a better format than Shapefile, but this is mostly due to the fact that JSON itself is so well-understood and easy to work with, so it’s a simpler jump for me. Your needs may be different!

But Wait, There’s More!

There are other geospatial data types with smaller market share:

Working With Map Files

Let’s just start working with some map files to see what they look like under the hood.
All of my files can be found at https://github.com/pm0kjp/mapping-geographic-data-in-r or https://rstudio.cloud/project/334226 (YMMV)